-
Notifications
You must be signed in to change notification settings - Fork 3k
[Feature] Support mamba radix cache v0 #11214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: hanming-lu <[email protected]> Co-authored-by: hzh0425 <[email protected]> Co-authored-by: thalahors <[email protected]>
Summary of ChangesHello @yizhang2077, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a foundational feature for supporting Mamba radix cache (v0) within the SGLang system. The core objective is to enhance the efficiency of KV cache management for models incorporating Mamba architectures. This is achieved by implementing a specialized radix tree that intelligently handles both standard and Mamba-specific KV states, allowing for better resource utilization and faster inference. The changes span across memory allocation, request scheduling, and cache eviction policies, culminating in significant performance gains as evidenced by the provided benchmarks. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Mamba radix cache, which is a significant feature enhancement. The implementation is comprehensive, touching upon scheduling, memory management, and the model execution flow. The new MambaRadixCache
is well-structured, and unit tests have been added. I've identified a few areas for improvement, including a potential bug in an assertion, a type hint mismatch, and the use of a magic number that should be a constant. Overall, this is a solid contribution.
if self.is_hybrid_gdn: | ||
max_num_reqs = min(max_num_reqs, self.server_args.max_mamba_cache_size) | ||
# for mamba cache radix, it need be divided by 3 (magic number now). (yizhang2077) | ||
max_num_reqs = min(max_num_reqs, self.server_args.max_mamba_cache_size // 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses a magic number 3
to divide max_mamba_cache_size
. The comment acknowledges this. It's better to define this as a named constant with a clear explanation of why this division is necessary. This improves code readability and maintainability. For example: MAMBA_CACHE_REQS_RATIO = 3
could be defined at the top of the file or in a constants module.
You need to fix the typo and rename the token_msg variables to token_usage_msg:
|
|
||
|
||
class MambaRadixCache(BasePrefixCache): | ||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it compatible with MTP? EAGLE fix also should be applied to MambaRadixCache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think maybe we can do it in another PR
During testing, I discovered that the server crashed when token_usage reached more than 0.99. |
Do you have reproduce command? I think token_usage > 0.99 is an abnormal state. (It is too large and other models will crash as well in this state) |
reproduce command (server H100, tp-size=2): The root cause is incorrect memory availability checking in the Mamba pool. Instead of checking the Mamba pool, only the MHA pool is used, which leads to attempted memory allocation in a full Mamba pool and subsequent server crash due to None being returned from mamba_pool.alloc(). Clode Sonnet's recommendations:
|
@Swipe4057 Mamba pool controls memory capability by setting available size to 3x |
Run the service with the command and try testing again: Model: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct |
72812db
to
67d4e34
Compare
@Swipe4057 I have tried and it is ok. Could you share your server log? (error and |
open sanity_check |
8305248
to
8022cdf
Compare
@yizhang2077 Could you resolve the conflicts? Then we can merge it. |
yizhang2077 hanming-lu ispobock I retested your new code., --model-path /data/models/Qwen3-Next-80B-A3B-Instruct --served-model-name Qwen3-Next-80B-A3B-Instruct --cuda-graph-max-bs 512 --cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256 264 272 280 288 296 304 312 320 328 336 344 352 360 368 376 384 392 400 408 416 424 432 440 448 456 464 472 480 488 496 504 512 --sleep-on-idle --port 8027 --host 0.0.0.0 --api-key ${SGLANG_API_KEY} --schedule-policy lof --random-seed 11111 --context-length 131072 --grammar-backend xgrammar --prefill-attention-backend flashinfer --decode-attention-backend flashinfer --tool-call-parser qwen25 --enable-metrics --quantization w8a8_fp8 --allow-auto-truncate --tp-size 2 --ep-size 2 --chunked-prefill-size 16384 --max-running-requests 1024 --mem-fraction-static 0.87 --mamba-ssm-dtype bfloat16 --mamba-full-memory-ratio 4 TEST-1: results: mr, enable radix: logs: ![]() TEST-2: results: mr, enable radix: As you can see, the last test sends 8 groups of identical prompts with a length of 1000 characters, without a variable part and with only 1 token for generation. Performance with radix cache enabled dropped dramatically! |
yizhang2077 Try this code to fix null cache hit:
TEST-1: results: mr+THIS FIX, enable radix: TEST-2: results: mr+THIS FIX, enable radix: |
@Swipe4057 let's merge this PR to add mamba radix cache functionality. welcome to make changes to improve performance! |
70f8e56
to
c4273af
Compare
/data/install/backup/sglang_main/python/sglang/srt/mem_cache/memory_pool.py", line 306, in alloc on 20 requests,--max-mamba-cache-size 100000 |
Signed-off-by: Shangming Cai <[email protected]>
Co-authored-by: hanming-lu <[email protected]> Co-authored-by: hzh0425 <[email protected]> Co-authored-by: thalahors <[email protected]>
if config := self.mamba2_config: | ||
class_name = config.__class__.__name__ | ||
logger.warning(f"{class_name} model detected, disable radix cache") | ||
self.server_args.disable_radix_cache = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, we disabled radix tree?
Motivation
ref #10438. add radix cache for mamba, we will implement page_size > 1 and Marconi soon
Co-authored-by: hanming-lu [email protected]
Co-authored-by: hzh0425 [email protected]
Co-authored-by: thalahors [email protected]
Modifications
ref: doc
Accuracy Tests
Benchmarking and Profiling
Checklist